All You Need to Know about Scheduling Deep Learning Jobs
نویسنده
چکیده
With the recent breakthrough in deep neural network, there is an emerging class of data center with accelerated hardware to support efficient training on neural network model [1, 2]. The accelerated hardware (e.g. GPU, FPGA, TPU [9], Cambricon [11]), interconnected with high speed network (e.g. infiniband) and coupled with large training data [7], provides orders of magnitude training speedup. In this paper, we study the new challenge of resource management derived from the characteristic of deep learning workload in a cluster with accelerated hardware. The first challenge is to find an extensible resource abstraction to represent the diversified and fast evolving accelerated devices. Deep learning job should be able to learn the resource type and its usage, and be able to request for a certain type of devices with specific topology requirement. In Section 2, we introduce detailed hardware configuration in a typical data center for deep learning and propose a resource abstraction to address this challenge. The second challenge results from a tension in deep learning job scheduling. For multiple deep learning jobs, we find the system should “spread” them away to avoid mutual interference. While for a large deep learning job that requires multiple accelerated devices, the system should “pack” it to the devices that are close to each other to avoid significant loss of training speed. The spreading will lead to the fragmented usage of the accelerated devices, while the packing would require the consecutive slots in the devices. In Section 3, we quantify the effects of job interference and demonstrate the significant performance difference for a large job with different locality setting. We then discuss some possible way to resolve the tension introduced by job spreading and packing.
منابع مشابه
Teaching and Learning About Evolution
25 T he National Association of Biology Teachers considers evolution to be the foundation for middle school life science. In the National Science Education Standards (NSES), evolution is an essential component of the science curriculum at all grade levels. With this book as your guide, your charge is to help youth learn about evolution as the unifying theme of the life sciences. How do you guid...
متن کاملOperation Scheduling of MGs Based on Deep Reinforcement Learning Algorithm
: In this paper, the operation scheduling of Microgrids (MGs), including Distributed Energy Resources (DERs) and Energy Storage Systems (ESSs), is proposed using a Deep Reinforcement Learning (DRL) based approach. Due to the dynamic characteristic of the problem, it firstly is formulated as a Markov Decision Process (MDP). Next, Deep Deterministic Policy Gradient (DDPG) algorithm is presented t...
متن کاملSolving the Problem of Scheduling Unrelated Parallel Machines with Limited Access to Jobs
Nowadays, by successful application of on time production concept in other concepts like production management and storage, the need to complete the processing of jobs in their delivery time is considered a key issue in industrial environments. Unrelated parallel machines scheduling is a general mood of classic problems of parallel machines. In some of the applications of unrelated parallel mac...
متن کاملWelcome to virosphere
Viruses may seem alien, but they are the most abundant and, arguably, the most important organisms on Earth. They are found just about everywhere, from oceans and forests to the people around you and, of course, in and on you as well. This world of strange, quasi-living things has been dubbed the virosphere, and it is a mysterious one – we know less about viruses than any other life form. But t...
متن کاملOnline Job Scheduling in Distributed Machine Learning Clusters
Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural network, multiple workers are run in parallel to train partitions of the input dataset, and update shared model parameters. In a shared cluster handling multiple tr...
متن کامل